refactor ops processing with output_feature_hints method#986
Open
cmgzn wants to merge 2 commits into
Open
Conversation
…or ops processing
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to declare partial output feature hints (output_feature_hints) for operators, resolving schema inference issues in HuggingFace when early batches contain empty lists or ambiguous types. The feedback points out a critical issue in _merge_feature_dicts where recursively merging incompatible types (such as a Sequence and a Struct, which both inherit from Mapping) can corrupt the feature structure, and provides a code suggestion to ensure type compatibility before merging.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
output_feature_hintsfor HuggingFace-backedNestedDataset.map.Some operators write nested list fields whose first writer batch can be all empty
lists. HuggingFace Datasets may infer those values as
list<null>, then failwhen later batches return concrete nested values such as
list<list<int64>>orlist<list<float32>>.This PR lets operators provide partial output feature hints.
NestedDataset.mapmerges those hints into the current dataset features and forwards the merged
schema to
Dataset.map(features=...), so the Arrow writer does not inferambiguous empty-list fields as
null.Changes
NestedDataset.map(..., output_feature_hints=...).OP.output_feature_hints(input_features)as the operator-level schema hint hook.Why
This moves schema disambiguation from ad-hoc return-value shaping inside operators
to an explicit dataset-level mechanism.
Operators can return natural empty values such as
[], while still tellingHuggingFace/Arrow the intended concrete output type before map cache batches are
written.
Behavior and Compatibility
This is opt-in. Operators that do not implement
output_feature_hints()keep theprevious HuggingFace inference behavior.
For operators that do provide hints, HuggingFace Datasets will cast mapped values
to the declared features. This is intentional, but it means the declared feature
type must match the actual returned values. For example, bbox coordinates
declared as
float32may be stored withfloat32precision.There is also a subtle distinction between preserving old schema workarounds and
changing output semantics. Some operators used sentinel values such as
zero-filled boxes to avoid empty-list schema inference. Replacing those sentinels
with
[]is semantically cleaner, but changes exported data from a one-box zerosentinel to an actually empty list. Downstream code should handle both forms
before such sentinels are removed.
Tests
python -m unittest tests.core.data.test_dj_dataset.TestNestedDataset.test_map_output_feature_hints_allow_empty_nested_list_first_batchpython -m py_compile data_juicer/core/data/dj_dataset.py data_juicer/ops/base_op.py data_juicer/ops/mapper/imgdiff_difference_caption_generator_mapper.py data_juicer/ops/mapper/imgdiff_difference_area_generator_mapper.pygit diff --check